[Feature][Perf] Support Selective CPU Weight Offloading#34535
[Feature][Perf] Support Selective CPU Weight Offloading#34535vllm-bot merged 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
9683388 to
fe927de
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a useful feature for selectively offloading model parameters to the CPU, which can significantly improve performance in memory-constrained scenarios, as demonstrated by the provided benchmarks. The implementation is clear and follows existing patterns in the codebase. The changes to the configuration and model loading logic are well-integrated. The parameter name matching logic, while a bit subtle, appears correct and robust for its intended purpose. Overall, this is a solid contribution that enhances the flexibility and performance of vLLM.
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
…#34535) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
|
@wzhao18 Is it possible to use regex syntax like in e.g., |
|
@ehfd Can you share the motivation? I did not choose to go with regex because I want to keep it as simple as possible. If you find cases where it could not be expressed with the current way, maybe we should consider support regex. |
|
@ehfd Different MoE models name their parameters different, relying on fixed regex patterns to distinguish MoE expert weight may not work. That said, given the naming convention is the same across layers, I think the current way is expressive enough for offloading any specific model weights. For any model in huggingface, you can check the index file for the weight names - e.g. https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/model.safetensors.index.json |
|
@wzhao18 It turns out your notation is the right one for |
Purpose
This PR adds support to selectively offload parameters to CPU based on name matching. One use case is to only offload experts weights for MoE weights, which is useful for low-concurrency settings. This is turned on by passing argument
--cpu-offload-paramsTest Plan
Tested offloading Kimi K2 NVFP4 on one GB300.
Benchmarking single-user throughput:
Test Result
Before: 15 tok/s
Offloading MoE weights only: 31 tok/s
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.